An Asymptotically Optimal UCB Policy for Uniform Bandits of Unknown Support

نویسندگان

  • Wesley Cowan
  • Michael N. Katehakis
چکیده

Consider the problem of a controller sampling sequentially from a finite number of N ≥ 2 populations, specified by random variables X k, i = 1, . . . , N, and k = 1, 2, . . .; where X i k denotes the outcome from population i the k time it is sampled. It is assumed that for each fixed i, {X k}k≥1 is a sequence of i.i.d. uniform random variables over some interval [ai, bi], with the support (i.e., ai, bi) unknown to the controller. The objective is to have a policy π for deciding, based on available data, from which of the N populations to sample from at any time n = 1, 2, . . . so as to maximize the expected sum of outcomes of n samples or equivalently to minimize the regret due to lack on information of the parameters {ai} and {bi}. In this paper, we present a simple UCB-type policy that is asymptotically optimal. Additionally, finite horizon regret bounds are given.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Bayesian Upper Confidence Bounds for Bandit Problems

Stochastic bandit problems have been analyzed from two different perspectives: a frequentist view, where the parameter is a deterministic unknown quantity, and a Bayesian approach, where the parameter is drawn from a prior distribution. We show in this paper that methods derived from this second perspective prove optimal when evaluated using the frequentist cumulated regret as a measure of perf...

متن کامل

Lipschitz Bandits: Regret Lower Bound and Optimal Algorithms

We consider stochastic multi-armed bandit problems where the expected reward is a Lipschitz function of the arm, and where the set of arms is either discrete or continuous. For discrete Lipschitz bandits, we derive asymptotic problem specific lower bounds for the regret satisfied by any algorithm, and propose OSLB and CKL-UCB, two algorithms that efficiently exploit the Lipschitz structure of t...

متن کامل

Asymptotically optimal priority policies for indexable and non-indexable restless bandits

We study the asymptotic optimal control of multi-class restless bandits. A restless bandit isa controllable stochastic process whose state evolution depends on whether or not the bandit ismade active. Since finding the optimal control is typically intractable, we propose a class of prioritypolicies that are proved to be asymptotically optimal under a global attractor property an...

متن کامل

A minimax and asymptotically optimal algorithm for stochastic bandits

We propose the kl-UCB algorithm for regret minimization in stochastic bandit models with exponential families of distributions. We prove that it is simultaneously asymptotically optimal (in the sense of Lai and Robbins’ lower bound) and minimax optimal. This is the first algorithm proved to enjoy these two properties at the same time. This work thus merges two different lines of research with s...

متن کامل

Regret Analysis of the Finite-Horizon Gittins Index Strategy for Multi-Armed Bandits

I prove near-optimal frequentist regret guarantees for the finite-horizon Gittins index strategy for multi-armed bandits with Gaussian noise and prior. Along the way I derive finite-time bounds on the Gittins index that are asymptotically exact and may be of independent interest. I also discuss computational issues and present experimental results suggesting that a particular version of the Git...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1505.01918  شماره 

صفحات  -

تاریخ انتشار 2015